NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

An End-to-end High-performance Deduplication Scheme for Docker Registries and Docker Container Storage Systems

https://doi.org/10.1145/3643819

Zhao, Nannan; Lin, Muhui; Albahar, Hadeel; Paul, Arnab K; Huan, Zhijie; Abraham, Subil; Chen, Keren; Tarasov, Vasily; Skourtis, Dimitrios; Anwar, Ali; et al (August 2024, ACM Transactions on Storage)

The wide adoption of Docker containers for supporting agile and elastic enterprise applications has led to a broad proliferation of container images. The associated storage performance and capacity requirements place a high pressure on the infrastructure ofcontainer registriesthat store and distribute images andcontainer storage systemson the Docker client side that manage image layers and store ephemeral data generated at container runtime. The storage demand is worsened by the large amount of duplicate data in images. Moreover, container storage systems that use Copy-on-Write (CoW) file systems as storage drivers exacerbate the redundancy. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the growing storage requirements of container registries and improve the space efficiency of container storage systems. However, existing deduplication techniques significantly degrade the performance of both registries and container storage systems because of data reconstruction overhead as well as the deduplication cost. We propose DupHunter, an end-to-end deduplication scheme that deduplicates layers for both Docker registries and container storage systems while maintaining a high image distribution speed and container I/O performance. DupHunter is divided into three tiers: registry tier, middle tier, and client tier. Specifically, we first build a high-performance deduplication engine at the registry tier that not only natively deduplicates layers for space savings but also reduces layer restore overhead. Then, we use deduplication offloading at the middle tier to eliminate the redundant files from the client tier and avoid bringing deduplication overhead to the clients. To further reduce the data duplicates caused by CoWs and improve the container I/O performance, we utilize a container-aware storage system at the client tier that reserves space for each container and arranges the placement of files and their modifications on the disk to preserve locality. Under real workloads, DupHunter reduces storage space by up to 6.9× and reduces theGETlayer latency up to 2.8× compared to the state-of-the-art. Moreover, DupHunter can improve the container I/O performance by up to 93% for reads and 64% for writes.
more » « less
Full Text Available
Balancing Costs and Durability for Serverless Data

Merenstein, Alex; Wang, Xinran; Tarasov, Vasily; Agarwal, Prajjawal; Guthridge, Scott; Thakkar, Kapil; Wu, Katherine; Anwar, Ali; Zadok, Erez (June 2024, IEEE)

Durability features such as replication or erasure coding serve an important role in storage systems, enabling users to store data without fear of loss due to device failures. However, these durability features come with a cost, in terms of storage, network traffic, and computational overheads. For most data, loss is a catastrophic event and so these overheads are acceptable. However, some data tolerates low durability and does not need the high level of durability that most storage systems provide. Identifying the proper level of durability for a piece of data is difficult, especially since it is often not clear how to determine the cost of loss. For some data used in serverless applications, however, this cost is relatively straightforward to calculate: serverless functions are often required to be idempotent, meaning that the data produced by them can be re-created by re-running the function. The cost of losing a piece of data then is merely the cost of re-running the function that originally created the data. In this paper, we explore the tradeoff between the cost of storing data durably and the cost to re-create data. We focus on serverless data because its ability to be recreated makes it possible to assign a cost to its loss. We develop a mathematical model that relates compute costs, storage costs, and application-specific parameters to calculate the cost-optimal placement of data. We also develop an execution framework capable of handling lost data transparently, enabling applications to use lower-durability storage with no additional burden on the developer. Next, we show how different factors such as failure rate and compute costs affect the placement decision. We find that thanks to the relatively short lifetime of serverless data, the probability of data loss even on low-durability storage is fairly low. Finally, we use the model to place data for several applications, including a video-transcoding application and an image-assembly application. We show that our model can predict execution costs within 7% of actual execution costs, and can reduce storage costs by up to 3x while never exceeding baseline costs.
more » « less
Full Text Available
F3: Serving Files Efficiently in Serverless Computing

https://doi.org/10.1145/3579370.3594771

Merenstein, Alex; Tarasov, Vasily; Anwar, Ali; Guthridge, Scott; Zadok, Erez (June 2023, The 16th ACM International Systems and Storage Conference (SYSTOR '23))

Serverless platforms offer on-demand computation and represent a significant shift from previous platforms that typically required resources to be pre-allocated (e.g., virtual machines). As serverless platforms have evolved, they have become suitable for a much wider range of applications than their original use cases. However, storage access remains a pain point that holds serverless back from becoming a completely generic computation platform. Existing storage for serverless typically uses an object interface. Although object APIs are simple to use, they lack the richness, versatility, and performance of file based APIs. Additionally, there is a large body of existing applications that relies on file-based interfaces. The lack of file based storage options prevents these applications from being ported to serverless environments. In this paper, we present F3, a file system that offers features to improve file access in serverless platforms: (1) efficient handling of ephemeral data, by placing ephemeral and non-ephemeral data on storage that exists at a different points along the durability-performance tradeoff continuum, (2) locality-aware data scheduling, and (3) efficient reading while writing. We modified OpenWhisk to support attaching file-based storage and to leverage F3's features using hints. Our prototype evaluation of F3 shows improved performance of up to 1.5--6.5x compared to existing storage systems.
more » « less
Full Text Available
SION: Elastic Serverless Cloud Storage

Zhang, Jingyuan; Wang, Ao; Ma, Xiaolong; Carver, Benjamin; Newman, Nicholas; Anwar, Ali; Rupprecht, Lukas; Skourtis, Dimitrios; Tarasov, Vasily; Yan, Feng; et al (August 2023, International Conference on Very Large Data Bases (VLDB 2023))

Full Text Available
InfiniStore: Elastic Serverless Cloud Storage

https://doi.org/10.14778/3587136.3587139

Zhang, Jingyuan; Wang, Ao; Ma, Xiaolong; Carver, Benjamin; Newman, Nicholas John; Anwar, Ali; Rupprecht, Lukas; Tarasov, Vasily; Skourtis, Dimitrios; Yan, Feng; et al (March 2023, Proceedings of the VLDB Endowment)

Cloud object storage such as AWS S3 is cost-effective and highly elastic but relatively slow, while high-performance cloud storage such as AWS ElastiCache is expensive and provides limited elasticity. We present a new cloud storage service called ServerlessMemory, which stores data using the memory of serverless functions. ServerlessMemory employs a sliding-window-based memory management strategy inspired by the garbage collection mechanisms used in the programming language to effectively segregate hot/cold data and provides fine-grained elasticity, good performance, and a pay-per-access cost model with extremely low cost. We then design and implement InfiniStore, a persistent and elastic cloud storage system, which seamlessly couples the function-based ServerlessMemory layer with a persistent, inexpensive cloud object store layer. InfiniStore enables durability despite function failures using a fast parallel recovery scheme built on the auto-scaling functionality of a FaaS (Function-as-a-Service) platform. We evaluate InfiniStore extensively using both microbenchmarking and two real-world applications. Results show that InfiniStore has more performance benefits for objects larger than 10 MB compared to AWS ElastiCache and Anna, and InfiniStore achieves 26.25% and 97.24% tenant-side cost reduction compared to InfiniCache and ElastiCache, respectively.
more » « less
Full Text Available
DupHunter: Flexible High-Performance Deduplication for Docker Registries

Zhao, Nannan; Albahar, Hadeel; Abraham, Subil; Chen, Keren; Tarasov, Vasily; Skourtis, Dimitrios; Rupprecht, Lukas; Anwar, Ali; Butt, Ali R. (July 2020, USENIX Annual Technical Conference (ATC'20))

Full Text Available
InfiniCache: Exploiting Ephemeral Serverless Functions to Build a Cost-Effective Memory Cache

Wang, Ao; Zhang, Jingyuan; Ma, Xiaolong; Anwar, Ali; Rupprecht, Lukas; Skourtis, Dimitrios; Tarasov, Vasily; Yan, Feng; Cheng, Yue (February 2020, 18th USENIX Conference on File and Storage Technologies)

Internet-scale web applications are becoming increasingly storage-intensive and rely heavily on in-memory object caching to attain required I/O performance. We argue that the emerging serverless computing paradigm provides a well-suited, cost-effective platform for object caching. We present InfiniCache, a first-of-its-kind in-memory object caching system that is completely built and deployed atop ephemeral serverless functions. InfiniCache exploits and orchestrates serverless functions' memory resources to enable elastic pay-per-use caching. InfiniCache's design combines erasure coding, intelligent billed duration control, and an efficient data backup mechanism to maximize data availability and cost-effectiveness while balancing the risk of losing cached state and performance. We implement InfiniCache on AWS Lambda and show that it: (1) achieves 31 – 96× tenant-side cost savings compared to AWS ElastiCache for a large-object-only production workload, (2) can effectively provide 95.4% data availability for each one hour window, and (3) enables comparative performance seen in a typical in-memory cache.
more » « less
Full Text Available
InfiniCache: Exploiting Ephemeral Serverless Functions to Build a Cost-Effective Memory Cache

Wang, Ao; Zhang, Jingyuan; Ma, Xiaolong; Anwar, Ali; Rupprecht, Lukas; Skourtis, Dimitrios; Tarasov, Vasily; Yan, Feng; Cheng, Yue (February 2020, The 18th USENIX Conference on File and Storage Technologies (FAST 20))

Full Text Available
INFINICACHE: Exploiting Ephemeral Serverless Functions to Build a Cost-Effective Memory Cache

Wang, Ao; Zhang, Jingyuan Zhang; Ma, Xiaolong Ma; Anwar, Ali; Rupprecht, Lukas; Skourtis, Dimitrios; Tarasov, Vasily; Yan, Feng; Cheng, Yue (February 2020, Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST ’20))

Full Text Available
Data Storage Research Vision 2025 Report

Amvrosiadis, George; Butt, Ali R.; Tarasov, Vasily; Zadok, Erez; Zhao, Ming (April 2019, Technical Report)

Full Text Available

« Prev Next »

Search for: All records